Some more prefiltering
Pre-filtering genes
It is customary to remove genes that have very low expression, since their signal-to-noise ratio is very small to yield any biologically significant results.
Keep in the dds object only genes with at least 10 reads in more than 25% of the samples
Useful functions:
#> [1] 20180
nSamples <- ncol(dds)*.25
genes.keep <- rowSums(counts(dds) >=10) > nSamples
dds <- dds[genes.keep,]
nrow(dds)
#> [1] 17399
We went from about 20,200 genes down to about 17,400.
Sample characterization
Now we want to explore the samples in our dataset. In order to compare the columns of our matrix, however, we need to normalize the counts. The simplest way would be to take the sum of each column (i.e., the total library size of each sample) and divide the counts by the value. The DESeq2 package offers some more sophisticated methods, and we will use the “variance stabilising transformation” in this tutorial (the vst() function). The output is a DESeqTransform object, and the values calculated can be extracted from it using the assay() function:
norm_expr <- vst(dds) %>% assay
Execute the command above to obtain a matrix of normalized gene expression
norm_expr <- vst(dds) %>% assay
Before we plot the expression values we just calculated, let’s install and load the pheatmap package, used to plot heatmaps.
Now, instead of plotting the whole normalized expression matrix that we obtained using the vst() function, let us only plot the top 1,000 most varying genes using the pheatmap() function.
Calculate the standard deviation for each gene and pick the top 1,000 genes with the highest. Plot the vst expression matrix for those genes using the pheatmap() function.
Useful functions:
apply()
sd()
rowSds()
arrange()
desc()
slice_max()
pheatmap()
plot.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
slice_max(sd, n=1000) %>%
pull(gene_name)
norm_expr[plot.genes, ] %>% pheatmap()

The heatmap is not that informative to compare across samples, because it only shows that some genes have a higher level of expression than others. By default, pheatmap() does not scale the data, but we can do that using scale=“row” or scale=“column”.
Plot the heatmap again but this time using scale=“row”
norm_expr[plot.genes, ] %>%
pheatmap(scale="row")

What could the two clusters be representing? We can annotate the columns of the heatmap by using the option “annotation_col” of pheatmap(). We need to provide a data.frame with the variables to annotate. For example:
annotation_col=as.data.frame(colData(dds)[,c("condition","origin")])
Plot the same heatmap but adding annotations for the condition, the origin, the RIN values and the post-mortem time
norm_expr[plot.genes, ] %>%
pheatmap(scale="row", annotation_col=as.data.frame(colData(dds)[,c("condition","origin", "rin", "pm_time_min")]))

The fact that the two cohorts are so different is a reason for concern. We know that the cohorts have very different RIN values and post-mortem intervals, and gene expression may reflect those differences.
Pairwise correlation between samples
We are going to calculate how different each pair of samples are, and then summarize each sample by its median difference to the rest of the samples.
The cor() function calculates the correlation between each pair of columns of the input matrix, and we can take advantage of it.
Use the cor() function on the subset of 1,000 most varying genes from the vst transformed gene expression data to calculate all pairwise correlations and plot the resulting correlation matrix using pheatmap().
Useful functions:
Cors <- cor(norm_expr[plot.genes, ])
pheatmap(Cors)

Calculate the median correlation per sample.
Useful functions:
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
#> SL283585 SL283597 SL283590 SL283575 SL283616 SL283618 SL283570 SL283581
#> 0.7567623 0.7572315 0.7606560 0.7871291 0.8066868 0.8169079 0.8392480 0.8453424
#> SL283566 SL283584 SL283576 SL283610 SL283599 SL283595 SL283596 SL283614
#> 0.8495857 0.8548754 0.8627927 0.8654529 0.8686231 0.8698628 0.8741854 0.8835448
#> SL283587 SL283583 SL283617 SL283589 SL283592 SL283565 SL283605 SL283608
#> 0.8841093 0.8846946 0.8890795 0.8904316 0.8909524 0.8911854 0.8920549 0.8924803
#> SL283586 SL283598 SL283600 SL283604 SL283603 SL283568 SL283593 SL283569
#> 0.8941451 0.8952702 0.8963120 0.8989816 0.9008564 0.9013622 0.9015567 0.9016811
#> SL283601 SL283612 SL283572 SL283574 SL283606 SL283602 SL283578 SL283580
#> 0.9024367 0.9046887 0.9079857 0.9086827 0.9087302 0.9100107 0.9100342 0.9122101
#> SL283591 SL283588 SL283579 SL283567 SL283577 SL283573 SL283594 SL283582
#> 0.9133564 0.9152114 0.9152738 0.9155428 0.9178124 0.9234158 0.9240418 0.9274764
#> SL283571
#> 0.9275740
A straightforward way to display outliers in this case is to plot a boxplot with ggplot2, which by default plots outliers as points:
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))
Plot a ggplot2 boxplot using the command above.
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))

We can see that four samples (SL283585, SL283597, SL283590, SL283575) are shown as outliers - they are quite different from all the rest.
Now we will plot all the samples in the first 2 principal components of a PCA. The easiest is probably to use the plotPCA() function from the DESeq2 package, which works on the result from vst(). Note that you have to run vst() again on the dds object, since plotPCA() works on its output.
Useful functions:
Use the plotPCA() function to visualize the samples in the first 2 principal components of the expression data.

Do the same, but for each cohort (origin) separately.
plotPCA(vst(dds)[,colData(dds)$origin=="PW"])

plotPCA(vst(dds)[,colData(dds)$origin=="NBB"])

Can you spot the two clusters in the NBB cohort?
The plotPCA() function from the DESeq2 package does something sneaky behind the scenes. If you check the help for the function, you’ll see that the argument “ntop” is set to 500 by default. To save time, the function carries out the PCA only on the top 500 most varying genes, similarly to what we were doing before (but using the top 1,000). What happens if we use, for example, only the top 100 instead?
Use plotPCA() with ntop=400, then ntop=200, and finally with ntop=10 for all samples.
plotPCA(vst(dds), ntop=400)

plotPCA(vst(dds), ntop=200)

plotPCA(vst(dds), ntop=10)

The fewer genes we use to construct the PCA, the more obvious the clusters become. What is going on? What could it be that separates our samples in two groups? Let’s have a look at this top 10 most varying genes.
First, load the gene descriptions that we used in the first tutorial using
geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds"))
Load the geneNames.Rds file into the R session and convert it to a tibble
geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds")) %>%
as_tibble
Then, find the gene names of the top10 most varying genes in the geneNames object.
Find the top10 most varying genes in the geneNames object.
tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
slice_max(sd, n=10) %>%
left_join(geneNames) %>%
select(gene_name, seqnames) %>%
unique
In which chromosome are most of these genes located? Can it explain the groups you obseved in the PCA?
Plot the PCA again using the plotPCA() function using ntop=10 and intgroup=“sex”.
plotPCA(vst(dds), ntop=10, intgroup="sex")

Although sex-associated gene expression is not exclusively restricted to sex chromosomes, it is of a much greater magnitude in the sex chromosomes.
Check sex assignment
We can use the strong association between sex-chromosome genes and sex to assess potential mis-labeling of samples. A quick way is to use all genes located in the Y chromosome to run a PCA.
Plot all the samples in the two principal components of the PCA of the Y-located genes, colouring the points by the sex of the sample.
NOTE: use the vst() function on the whole dataset, subset after.
Ygenes <- geneNames %>% filter(seqnames == "Y", gene_biotype == "protein_coding") %>%
pull(gene_name) %>% unique
vst(dds)[rownames(dds) %in% Ygenes,] %>% plotPCA(ntop=Inf, intgroup="sex")

As you can see, most of the variance is explained by the first principal component, as expected. It would in fact be enough to plot the values for the PC1.
correct_sign <- function(x) {
if (median(x)<0) return(-1*x)
else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>%
assay %>%
t %>%
prcomp
pca1$x %>%
as_tibble(rownames="sample_id") %>%
select(sample_id, PC1) %>%
mutate(PC1=correct_sign(PC1)) %>%
left_join(as_tibble(colData(dds))) %>%
ggplot(aes(x=sex, y=PC1)) +
geom_boxplot(outlier.shape=NA) +
geom_jitter(aes(colour=sex), height=0, width=.2)
correct_sign <- function(x) {
if (median(x)<0) return(-1*x)
else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>%
assay %>%
t %>%
prcomp
pca1$x %>%
as_tibble(rownames="sample_id") %>%
select(sample_id, PC1) %>%
mutate(PC1=correct_sign(PC1)) %>%
left_join(as_tibble(colData(dds))) %>%
ggplot(aes(x=sex, y=PC1)) +
geom_boxplot(outlier.shape=NA) +
geom_jitter(aes(colour=sex), height=0, width=.2)

For simplicity, we are going to remove from the dds object all genes mapped to the sex chromosomes.
Remove all X and Y genes from the dds object.
sex.genes <- geneNames %>% filter(seqnames %in% c("X","Y")) %>%
pull(gene_name) %>% unique
dim(dds)
#> [1] 17399 49
dds <- dds[!rownames(dds) %in% sex.genes,]
dim(dds)
#> [1] 16755 49
Recalculate the pairwise correlations between samples as before with the filtered dataset. Plot the values and plot the PCA with ntop=1000
norm_expr <- vst(dds) %>% assay
top1000.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
slice_max(sd, n=1000) %>%
pull(gene_name)
Cors <- cor(norm_expr[top1000.genes, ])
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
#> SL283597 SL283585 SL283590 SL283575 SL283616 SL283618 SL283581 SL283570
#> 0.7627503 0.7682360 0.7782123 0.7977671 0.8229228 0.8239435 0.8447489 0.8559964
#> SL283566 SL283584 SL283576 SL283599 SL283610 SL283596 SL283595 SL283583
#> 0.8576488 0.8708081 0.8778189 0.8798999 0.8831353 0.8856723 0.8859760 0.8899975
#> SL283614 SL283592 SL283587 SL283617 SL283565 SL283603 SL283608 SL283589
#> 0.8914506 0.8977170 0.8985736 0.9024982 0.9032124 0.9049152 0.9060812 0.9098299
#> SL283605 SL283568 SL283604 SL283593 SL283586 SL283598 SL283600 SL283612
#> 0.9100418 0.9114346 0.9129014 0.9129787 0.9129787 0.9136635 0.9140187 0.9146281
#> SL283602 SL283578 SL283601 SL283574 SL283606 SL283572 SL283569 SL283591
#> 0.9175766 0.9194735 0.9209084 0.9209463 0.9216333 0.9224393 0.9234645 0.9243526
#> SL283580 SL283567 SL283579 SL283588 SL283594 SL283577 SL283573 SL283582
#> 0.9244825 0.9249829 0.9256202 0.9308943 0.9318754 0.9344447 0.9352384 0.9364574
#> SL283571
#> 0.9429330
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))


In order to plot the sample ids, we need to use the “returnData=TRUE” option, in the plotPCA() function which, instead of plotting, will then returns the data in a format that can be easily used with ggplot2, adding a layer with the labels:
ggplot(data) +
geom_text(aes(label=name), nudge_y=1.5, size=3)
We could also, of course, run the PCA ourselves using prcomp().
plotPCA(vst(dds), returnData=TRUE) %>%
ggplot(aes(x=PC1, y=PC2)) +
geom_point(aes(colour=condition)) +
geom_text(aes(label=name), nudge_y=1.5, size=3)

Evaluating the PCA plot and the heatmap of correlations, do you think it is justified to remove any of the samples?
Cell type estimation
Before proceeding with the differential expression analysis, since we are dealing with human brain tissue, we need to investigate what the cell composition is in our samples. This is of fundamental importance in these type of datasets because most of the variability in gene expression in explained by the cell types.
We will use a very simple approach to estimate cell types similar, in a way, to what we used to “estimate” sex: summarize gene expression of a set of selected markers as their first principal component (PC1). The marker genes will be known cell type-specific genes.
Let’s load a list of markers of cortical cell types from Neuroespresso:
cortical.markers <- read_tsv("./Data/cortical_markers.tsv")
Read the tab-separated file “cortical_markers.tsv” into a tibble
cortical.markers <- read_tsv("./Data/cortical_markers.tsv")
Now let us run a PCA (as we have done until now) but subsetting the genes to the neuronal markers. We will use the prcomp() function and extract the values of the PC1 for each sample.
Run a PCA using prcomp() only on neuronal markers. Explore the result and find where the coordinates for the first PC are.
estimate_ct <- function(object, genes, ct="PC1", rescale=TRUE) {
object.vst <- vst(object)
pca <- object.vst[rownames(object.vst) %in% genes,] %>%
assay() %>%
t() %>%
prcomp()
result <- as_tibble(pca$x, rownames="sample_id") %>%
select(sample_id, PC1) %>%
mutate(PC1=correct_sign(PC1))
colnames(result) <- c("sample_id", ct)
if (rescale) result[,ct] <- scale(result[,ct])[,1]
return(result)
}
neuronal.est <- estimate_ct(dds, cortical.markers %>% filter(cell_type=="Neuron") %>% pull(gene_name), ct="Neuron")
Now we can do that for every other cortical cell type.
Calculate the first principal component for the other cell types and add them all to the metadatata of the dds object (colData(dds)).
ct.est <- lapply(unique(cortical.markers$cell_type), function(ct){
message(paste0(ct, "..."))
Genes <- filter(cortical.markers, cell_type==ct) %>% pull(gene_name)
estimate_ct(dds, genes=Genes, ct=ct)
}) %>% Reduce("left_join", .)
newColData <- left_join(as_tibble(colData(dds)), ct.est, by="sample_id") %>%
as("DataFrame")
rownames(newColData) <- newColData$sample_id
all(rownames(colData(dds)) == rownames(newColData))
#> [1] TRUE
colData(dds) <- newColData
And finally, do the same with the RNA markers for the cortical synapsome, which are not quite the same as the set of neuronal markers. These are RNAs found in the synapses, obtained from Hafner et al. 2019.
synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")
#syn <- qusage::read.gmt("./Data/synaptosome_symbols.gmt") %>% qdapTools::list2df() %>% as_tibble
#colnames(syn) <- c("gene_id", "cell_type")
#write_tsv(syn, "./Data/synaptosome_markers.tsv")
synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")
syn.est <- estimate_ct(dds, genes=unique(synaptosome.markers$gene_id), ct="synaptosome")
all(syn.est$sample_id == colData(dds)$sample_id)
#> [1] TRUE
colData(dds)$Synapses <- syn.est$synaptosome
colData(dds)
#> DataFrame with 49 rows and 16 columns
#> sample_id origin age_years sex pm_time_min condition
#> <character> <character> <numeric> <character> <numeric> <factor>
#> SL283595 SL283595 PW 79 M 2880 Control
#> SL283593 SL283593 PW 85 F 4320 Control
#> SL283618 SL283618 PW 88 F 3000 Control
#> SL283601 SL283601 PW 65 M 2880 Control
#> SL283565 SL283565 NBB 85 F 425 Control
#> ... ... ... ... ... ... ...
#> SL283592 SL283592 PW 69 M 3420 Case
#> SL283588 SL283588 PW 72 M 2880 Case
#> SL283600 SL283600 PW 78 F 1440 Case
#> SL283594 SL283594 PW 82 M 5040 Case
#> SL283590 SL283590 PW 95 M 2880 Case
#> rin Batch sample_id_paper Astrocyte Endothelial Microglia
#> <numeric> <character> <character> <numeric> <numeric> <numeric>
#> SL283595 3.4 2 Ctr-1 -1.184423 -1.160479 0.955535
#> SL283593 5.9 2 Ctr-10 -0.309941 0.537182 0.469155
#> SL283618 5.6 4 Ctr-11 0.217133 -1.489182 -2.429756
#> SL283601 4.9 3 Ctr-14 -0.498226 0.213325 0.744386
#> SL283565 6.6 3 Ctr-23 0.153747 0.319363 -0.260804
#> ... ... ... ... ... ... ...
#> SL283592 4.5 2 PD-5 -0.8294955 -0.938985 0.201632
#> SL283588 5.6 4 PD-6 -0.0199344 0.681234 0.483412
#> SL283600 6.2 3 PD-7 -1.0950377 -0.521545 1.167569
#> SL283594 6.1 1 PD-8 0.6990216 0.729143 0.121045
#> SL283590 5.9 2 PD-9 -2.7878270 -2.481439 1.771681
#> Oligo OligoPrecursors Neuron Synapses
#> <numeric> <numeric> <numeric> <numeric>
#> SL283595 0.782419 -0.6619144 -0.705792 -0.902350
#> SL283593 0.532277 0.0857936 0.630432 0.461680
#> SL283618 -0.160397 -1.1894146 -0.866282 -1.078199
#> SL283601 0.244944 -0.3442277 -0.264953 -0.349270
#> SL283565 -2.747786 -0.1931905 0.123844 0.269135
#> ... ... ... ... ...
#> SL283592 -1.5972658 -1.140781 -0.4162497 -0.652125
#> SL283588 0.0917733 -0.213461 0.0183912 -0.155062
#> SL283600 -1.1012333 -0.734881 -0.6989150 -0.817925
#> SL283594 -0.2983306 0.625531 0.8981723 0.721039
#> SL283590 -0.1372109 -1.721758 -2.5800732 -2.100692
How can we assess which variables are responsible for most of the variation in gene expression? PCA is very helpful in this regards. By definition, the first principal component will maximize the variance that it explains by combining linearly the variables. In a way, it is “summarizing” gene expression by optimally projecting it into one dimension. Then, with the leftover variance not explained by PC1, it will do the same, but with the constaint that PC2 needs to be perpendicular to PC1.
The first principal components are often referred to as the main lines of variation, because they gather as much as possible, and often serve as a good “summary” of the multidimensional data.
If we want to assess whether cell type composition (or other variables) explain most of the variation, a simple way could be to test for association between variables and PCs. This is a bit involved programmatically, so here is the code:
scale.vec <- function(x) scale(x)[,1]
pca2 <- dds %>%
vst %>%
assay %>%
t %>%
prcomp
Metadata.pca <- pca2$x %>%
as_tibble(rownames="sample_id") %>%
select(sample_id, PC1:PC5) %>%
mutate_if(is.numeric, scale.vec) %>%
left_join(as_tibble(colData(dds)))
Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
mutate_if(is.factor, as.integer)
allCors <- lapply(paste0("PC", 1:5), function(pc) {
lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
#message(paste0(pc, " vs ", var))
lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
broom::tidy() %>% filter(term!="(Intercept)") %>%
mutate(PC=pc, VAR=var)
}) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)
Pvals <- allCors %>% select(p.value, PC, VAR) %>%
pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC
Estimates <- allCors %>% select(estimate, PC, VAR) %>%
pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC
corrplot::corrplot(e.mat, p.mat=p.mat)
scale.vec <- function(x) scale(x)[,1]
pca2 <- dds %>%
vst %>%
assay %>%
t %>%
prcomp
Metadata.pca <- pca2$x %>%
as_tibble(rownames="sample_id") %>%
select(sample_id, PC1:PC5) %>%
mutate_if(is.numeric, scale.vec) %>%
left_join(as_tibble(colData(dds)))
Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
mutate_if(is.factor, as.integer)
allCors <- lapply(paste0("PC", 1:5), function(pc) {
lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
#message(paste0(pc, " vs ", var))
lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
broom::tidy() %>% filter(term!="(Intercept)") %>%
mutate(PC=pc, VAR=var)
}) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)
Pvals <- allCors %>% select(p.value, PC, VAR) %>%
pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC
Estimates <- allCors %>% select(estimate, PC, VAR) %>%
pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC
corrplot::corrplot(e.mat, p.mat=p.mat)

Do you see anything interesting? May it be that the RNA quality (RIN) is associated with cell type composition? That can easily be tested with a quick linear regression: try predicting the RIN values with the estimated values for synaptosomal content, then try the same with the estimated neuronal content, and then with both at the same time in the model. The formulas are as follow:
- rin ~ Synapses
- rin ~ Neuron
- rin ~ Synapses + Neuron
Then compare the resulting models using the summary() function on the linear fits.
Run 3 linear regressions with rin as a response variable and 1. Synapses; 2. Neuron; and 3. Synapses + Neuron as predictors. Check the adjusted R-squared values to compare the models.
# RIN
lm.syn <- lm(rin~Synapses, data=as.data.frame(colData(dds)))
lm.neu <- lm(rin~Neuron, data=as.data.frame(colData(dds)))
lm.both <- lm(rin~Neuron+Synapses, data=as.data.frame(colData(dds)))
lm.both.cohort <- lm(rin~Neuron+Synapses+origin, data=as.data.frame(colData(dds)))
lm.syn %>% summary
#>
#> Call:
#> lm(formula = rin ~ Synapses, data = as.data.frame(colData(dds)))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.5931 -0.7244 0.0062 0.6431 2.4892
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.9490 0.1348 44.141 < 2e-16 ***
#> Synapses 1.2082 0.1362 8.873 1.31e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9434 on 47 degrees of freedom
#> Multiple R-squared: 0.6262, Adjusted R-squared: 0.6182
#> F-statistic: 78.73 on 1 and 47 DF, p-value: 1.31e-11
#>
#> Call:
#> lm(formula = rin ~ Neuron, data = as.data.frame(colData(dds)))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.88488 -0.93143 -0.09251 0.55718 2.74936
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.9490 0.1552 38.343 < 2e-16 ***
#> Neuron 1.0846 0.1568 6.919 1.09e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.086 on 47 degrees of freedom
#> Multiple R-squared: 0.5046, Adjusted R-squared: 0.494
#> F-statistic: 47.87 on 1 and 47 DF, p-value: 1.085e-08
#>
#> Call:
#> lm(formula = rin ~ Neuron + Synapses, data = as.data.frame(colData(dds)))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.76501 -0.49485 -0.05657 0.66050 1.64005
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.9490 0.1230 48.378 < 2e-16 ***
#> Neuron -1.7596 0.5442 -3.234 0.00226 **
#> Synapses 2.9213 0.5442 5.369 2.53e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8608 on 46 degrees of freedom
#> Multiple R-squared: 0.6954, Adjusted R-squared: 0.6822
#> F-statistic: 52.51 on 2 and 46 DF, p-value: 1.334e-12
lm.both.cohort %>% summary
#>
#> Call:
#> lm(formula = rin ~ Neuron + Synapses + origin, data = as.data.frame(colData(dds)))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.68523 -0.54675 -0.01054 0.52954 1.62786
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6.0734 0.2152 28.222 < 2e-16 ***
#> Neuron -1.5875 0.5989 -2.651 0.011 *
#> Synapses 2.7036 0.6280 4.305 8.9e-05 ***
#> originPW -0.2177 0.3082 -0.706 0.484
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8655 on 45 degrees of freedom
#> Multiple R-squared: 0.6988, Adjusted R-squared: 0.6787
#> F-statistic: 34.79 on 3 and 45 DF, p-value: 8.653e-12
Synaptosome expression seem to explain RIN the best. But what if we consider “origin” as well? Which is the “best” model? Check the adjusted R squared for the models or run anova() on two models to test whether a more complex model is significantly better at capturing RIN variation.
Add “origin” as a covariate to the previous models and compare the adjusted R squared between the simpler and more comples versions of the model. Even better, use the anova() function to compare the models.
anova(
lm(rin~Synapses, data=as.data.frame(colData(dds))),
lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))
)
anova(
lm(rin~Synapses, data=as.data.frame(colData(dds))),
lm(rin~Synapses+origin, data=as.data.frame(colData(dds)))
)
anova(
lm(rin~Neuron, data=as.data.frame(colData(dds))),
lm(rin~Neuron+origin, data=as.data.frame(colData(dds)))
)
anova(
lm(rin~Synapses+origin, data=as.data.frame(colData(dds))),
lm(rin~Synapses+Neuron+origin, data=as.data.frame(colData(dds)))
)
lm.final <- lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))
You can also check the diagnostic plots by simply using the function plot() on the linear fit. Do you see a sample that you may want to exclude before running your model again? Does it make a difference?




Differential gene expression
Running an differential expression analysis
With the DESeq2 package, we can use the function DESeq() on our dds object to run the whole differential expression pipeline, which will run three steps in order:
estimateSizeFactors(dds): estimation of sample-specific normalization parameters
estimateDispersions(dds): estimation of gene-specific dispersion parameters
nbinomWaldTest(dds): negative binomial generalized linear model to calculate the desired log2-fold changes and calculation of Wald statistics
When you run DESeq, by default it will calculate the fit according to whatever model you had specified when you created the object. The model’s formula can be printed on screen using design(dds).
Let’s change the model’s formula of the object like so:
design(dds) <- ~ sex + age_years + condition
and then run DESeq() on the dds object.
Note 1: if you have manually changed something in the colData(dds), it may be that some columns are of type “character”. They should be converted to factors.
Note 2: to avoid some headaches later on, we want to be very explicit with our variable of interest (the disease status) by specifying which of the two levels (“case” or “control”) is the REFERENCE value. This can be done by using the relevel() function like so:
colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")
Run a differential expression analysis on the dds object using the DESeq() function. The model’s formula should be ~ sex + age + condition. Overwrite with the returned object the dds object.
Useful functions:
mutate_if()
is.character()
DESeq()
# convert to char to factor
chr_cols <- sapply(colData(dds), is.character)
colData(dds)[chr_cols] <- lapply(colData(dds)[chr_cols], as.factor)
# relevel condition just in case
colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")
design(dds) <- ~ sex + age_years + condition
dds <- DESeq(dds)
Looking at the results
The results() function extracts the log2 fold changes and p values from the dds object (as long as the DESeq function was run). By default, the function will extract the log2 fold change for the last variable in the design formula (in our case, the disease status). However, it is always good practice to explicitly specify the contrast when calling results() with the “contrast” or “name” argument. There are different ways of specifying contrasts but, for example, if we wanted to extract the differential expression results between males and females, we could use:
results(dds, name="sex_M_vs_F")
or
results(dds, contrast=c("sex", "M", "F"))
they are equivalent. A nice trick to see which contrasts can be extracted using “name=”, you can use
resultsNames(dds)
which prints all possible values for the contrast name. In our case, the possible values for the “name=” argument are: “Intercept”, “sex_M_vs_F”, “age_years”, and “condition_Case_vs_Control”
Use the results() function on the dds object to extract the differential expression between cases and controls and find out whether any genes are significant after correction.
#results(dds, name="sex_M_vs_F")
#results(dds, contrast=c("sex", "M", "F"))
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)
Are there any significant changes in gene expression associated with the disease?
We have seen that RIN is associated with the main lines of variation in gene expression. We should expect, then, that RIN is adding a lot of “uninteresting” variation to the dataset, i.e., a lot of noise. What happens if you add “rin” as a covariate?
You’ll need to update the formula and re-run the whole pipeline like so:
design(dds) <- ~ sex + age_years + rin + condition
dds <- DESeq(dds)
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)
Re-run the analyses with RIN as a covariate.
dds_rin <- dds
design(dds_rin) <- ~ sex + age_years + rin + condition
dds_rin <- DESeq(dds_rin)
res_rin <- results(dds_rin, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res_rin %>% arrange(padj)
What do you think has changed? How does the extra covariate added influence the results? What do you think would have happened if you added to the model a covariate that was highly associated with the condition of interest (i.e., “confounded”)?
Since the cohorts had different RINs and post-mortem intervals, try to run the cohorts separately and explore the results.
Run the differential expression analyses separately for each cohort.
dds_pw <- dds[,colData(dds)$origin=="PW"]
dds_pw <- DESeq(dds_pw)
results(dds_pw) %>% as_tibble(rownames="gene_name") %>% arrange(padj)
dds_nbb <- dds[,colData(dds)$origin=="NBB"]
dds_nbb <- DESeq(dds_nbb)
results(dds_nbb) %>% as_tibble(rownames="gene_name") %>% arrange(padj)
How can two cohorts exhibit such differences in their results? What does this suggest about inter-study replicability?
Gene set enrichment analyse
We are going to use the fgsea package to run a gene set enrichment analysis on your results.
Install the fgsea package using Bioconductor::install(fgsea) and load the package.
We can download “gmt” formatted files from MSigDB website, which can be imported into R using the gmtPathways() function from the fgsea package. I have already downloaded the KEGG database, which can be loaded into a variable like so:
KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")
Load the KEGG pathway list using the previous command.
KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")
We are going to prepare an input for the fgsea() function. We need to rank the genes according to our value of interest (in our case, it will be the “stat” column of our differential expression results, although you could use -log10(p-value) or log2FoldChange).
Create a vector with the “stat” values from the results and name the values with the corresponding gene names.
ranks <- res$stat
names(ranks) <- res$gene_name
Run gsea_res <- fgsea(KEGG_pathways, ranks) and explore the results (“ranks” is the vector of differential expression stats).
gsea_res <- fgsea(KEGG_pathways, ranks) %>% arrange(padj)
gsea_res
What is at the top? Can you run the GSEA with the results from the two alternate models (i.e., with and withouth accounting for the effect of RIN)?
Create a new vector with the stats of the appropriate results and run fgsea again.
ranks_rin <- res_rin$stat
names(ranks_rin) <- res_rin$gene_name
gsea_res_rin <- fgsea(KEGG_pathways, ranks_rin) %>% arrange(padj)
gsea_res_rin
What happens if you add Neuron or Synapses to the design model? How that does affect the GSEA results?
Run another differential expression analysis using Neuron and/or Synapses as covariates and explore the differences in the enriched pathways.
dds_neu <- dds
design(dds_neu) <- ~ sex + age_years + rin + Neuron + Synapses + condition
dds_neu <- DESeq(dds_neu)
res_neu <- results(dds_neu) %>% as_tibble(rownames="gene_name")
ranks_neu <- res_neu$stat
names(ranks_neu) <- res_neu$gene_name
gsea_res_neu <- fgsea(KEGG_pathways, ranks_neu) %>% arrange(padj)
gsea_res_neu
Can you think of a way to identify the pathways that are more susceptible to RIN and/or Neuron correction?
---
title: "Part 2: RNA-seq analysis"
author: "Gonzalo S. Nido"
date: "`r Sys.Date()`"
output:
    html_notebook:
        toc: true
        toc_float: true
        number_sections: true
        highlight: tango
---

<!--
To convert Markdown document to HTML,
R -e "rmarkdown::render('2.Rmd')"

To spell-check, spelling::spell_check_files()
-->

```{css, echo=FALSE, cache=FALSE}
q { color: Black; font-weight: bold; background-color: khaki; }
```


```{r global_options, include=FALSE, cache=FALSE}
require("knitr")
knitr::opts_chunk$set(fig.path='Figs_Rmd/', fig.height=4, fig.width=3,
                      warning=FALSE, message=FALSE,
                      echo=TRUE,
                      #echo=FALSE,
                      eval=TRUE,
                      #eval=FALSE,
                      results=TRUE,
                      #results=FALSE,
                      cache=FALSE,
                      comment="#>")
require("kableExtra")
```









# Packages

As usual, we will need to load some packages.


<q>Load the `tidyverse` and `DESeq2` packages, install them if they are not
already.</q>


```{r}
require("tidyverse")
require("DESeq2")
```

Now, load the DESeq2 object that you saved previously using the `readRDS`
function.

<q>Load the Rds object containing the DESeq2 object into the "dds" variable</q>

```{r}
dds <- readRDS("./dds.Rds")
```

While we already applied some filtering the to count matrix that we used as an
input, we can still pre-filter the dataset further. You can, in fact, treat the
DESeqDataSet object (in our `dds` variable) as a matrix to subset it.

For example, `dim(dds)` will return the dimensions of the matrix (number of
features or genes, and number of samples). Similarly, we can subset to the
first 3 samples and first 10 genes with `dds[1:10,1:3]`.

To access the full count matrix, you can use `counts(dds)`, and to access the
sample metadata, you use `colData(dds)`.

# Some more prefiltering

## Pre-filtering genes

It is customary to remove genes that have very low expression, since their
signal-to-noise ratio is very small to yield any biologically significant
results.

<q>Keep in the `dds` object only genes with at least 10 reads in more than 25%
of the samples</q>

Useful functions:

* `rowSums()`

```{r}
nrow(dds)
nSamples <- ncol(dds)*.25
genes.keep <- rowSums(counts(dds) >=10) > nSamples
dds <- dds[genes.keep,]
nrow(dds)
```

We went from about 20,200 genes down to about 17,400.


## Sample characterization

Now we want to explore the samples in our dataset. In order to compare the
columns of our matrix, however, we need to normalize the counts. The simplest
way would be to take the sum of each column (i.e., the total library size of
each sample) and divide the counts by the value. The `DESeq2` package offers
some more sophisticated methods, and we will use the "variance stabilising
transformation" in this tutorial (the `vst()` function).  The output is a
`DESeqTransform` object, and the values calculated can be extracted from it
using the `assay()` function:

    norm_expr <- vst(dds) %>% assay

<q>Execute the command above to obtain a matrix of normalized gene
expression</q>

```{r}
norm_expr <- vst(dds) %>% assay
```

Before we plot the expression values we just calculated, let's install and load
the `pheatmap` package, used to plot heatmaps.

```{r}
require("pheatmap")
```

Now, instead of plotting the whole normalized expression matrix that we
obtained using the `vst()` function, let us only plot the top 1,000 most
varying genes using the `pheatmap()` function.

<q>Calculate the standard deviation for each gene and pick the top 1,000 genes
with the highest. Plot the vst expression matrix for those genes using the
`pheatmap()` function.</q>

Useful functions:

* `apply()`
* `sd()`
* `rowSds()`
* `arrange()`
* `desc()`
* `slice_max()`
* `pheatmap()`

```{r, fig.height=12, fig.width=10}
plot.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=1000) %>%
    pull(gene_name)
norm_expr[plot.genes, ] %>% pheatmap()
```

The heatmap is not that informative to compare across samples, because it only
shows that some genes have a higher level of expression than others. By
default, `pheatmap()` does not scale the data, but we can do that using
scale="row" or scale="column".

<q>Plot the heatmap again but this time using scale="row"</q>

```{r, fig.height=12, fig.width=10}
norm_expr[plot.genes, ] %>%
    pheatmap(scale="row")
```

What could the two clusters be representing? We can annotate the columns of the
heatmap by using the option "annotation_col" of `pheatmap()`. We need to
provide a `data.frame` with the variables to annotate. For example:

    annotation_col=as.data.frame(colData(dds)[,c("condition","origin")])

<q> Plot the same heatmap but adding annotations for the condition, the origin,
the RIN values and the post-mortem time</q>

```{r, fig.height=10, fig.width=10}
norm_expr[plot.genes, ] %>%
    pheatmap(scale="row", annotation_col=as.data.frame(colData(dds)[,c("condition","origin", "rin", "pm_time_min")]))
```

The fact that the two cohorts are so different is a reason for concern. We know
that  the cohorts have very different RIN values and post-mortem intervals, and
gene expression may reflect those differences.


### Pairwise correlation between samples

We are going to calculate how different *each pair* of samples are, and then
summarize each sample by its median difference to the rest of the samples.

The `cor()` function calculates the correlation between each pair of columns of
the input matrix, and we can take advantage of it.

<q>Use the `cor()` function on the subset of 1,000 most varying genes from the
vst transformed gene expression data to calculate all pairwise correlations and
plot the resulting correlation matrix using `pheatmap()`.</q>

Useful functions:

* `cor()`
* `pheatmap()`

```{r, fig.height=10, fig.width=10}
Cors <- cor(norm_expr[plot.genes, ])
pheatmap(Cors)
```

<q>Calculate the median correlation per sample.</q>

Useful functions:

* `rowMedians()`
* `sort()`


```{r}
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
```

A straightforward way to display outliers in this case is to plot a boxplot with
ggplot2, which by default plots outliers as points:

    ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))

<q>Plot a ggplot2 boxplot using the command above.</q>

```{r}
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))
```

We can see that four samples (SL283585, SL283597, SL283590, SL283575) are shown
as outliers - they are quite different from all the rest.

Now we will plot all the samples in the first 2 principal components of a PCA.
The easiest is probably to use the `plotPCA()` function from the `DESeq2`
package, which works on the result from `vst()`. Note that you have to run
`vst()` again on the `dds` object, since `plotPCA()` works on its output.

Useful functions:

* `vst()`
* `plotPCA()`

<q>Use the `plotPCA()` function to visualize the samples in the first 2
principal components of the expression data.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds))
```

<q>Do the same, but for each cohort (origin) separately.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds)[,colData(dds)$origin=="PW"])
plotPCA(vst(dds)[,colData(dds)$origin=="NBB"])
```

Can you spot the two clusters in the NBB cohort?

The `plotPCA()` function from the `DESeq2` package does something sneaky behind
the scenes. If you check the help for the function, you'll see that the
argument "ntop" is set to 500 by default. To save time, the function carries
out the PCA only on the top 500 most varying genes, similarly to what we were
doing before (but using the top 1,000). What happens if we use, for example,
only the top 100 instead?

<q>Use `plotPCA()` with ntop=400, then ntop=200, and finally with ntop=10 for
all samples.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds), ntop=400)
plotPCA(vst(dds), ntop=200)
plotPCA(vst(dds), ntop=10)
```

The fewer genes we use to construct the PCA, the more obvious the clusters
become. What is going on? What could it be that separates our samples in two
groups? Let's have a look at this top 10 most varying genes.

First, load the gene descriptions that we used in the first tutorial using

    geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds"))

<q>Load the geneNames.Rds file into the R session and convert it to a
tibble</q>

```{r}
geneNames <- readRDS(url("https://git.app.uib.no/neuromics/cell-composition-rna-pd/-/raw/master/Data/EnsDb.Hsapiens.v75.Rds")) %>%
    as_tibble
```

Then, find the gene names of the top10 most varying genes in the `geneNames`
object.

<q>Find the top10 most varying genes in the `geneNames` object.</q>
    
```{r}
tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=10) %>%
    left_join(geneNames) %>%
    select(gene_name, seqnames) %>%
    unique
```

In which chromosome are most of these genes located? Can it explain the groups
you obseved in the PCA?

<q>Plot the PCA again using the `plotPCA()` function using *ntop=10* and
*intgroup="sex"*.</q>

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds), ntop=10, intgroup="sex")
```

Although sex-associated gene expression is not exclusively restricted to sex
chromosomes, it is of a much greater magnitude in the sex chromosomes.

## Check sex assignment

We can use the strong association between sex-chromosome genes and sex to assess
potential mis-labeling of samples. A quick way is to use all genes located in
the Y chromosome to run a PCA.

<q>Plot all the samples in the two principal components of the PCA of the
Y-located genes, colouring the points by the sex of the sample.</q> NOTE: use
the `vst()` function on the whole dataset, subset after.

```{r, fig.height=6, fig.width=5}
Ygenes <- geneNames %>% filter(seqnames == "Y", gene_biotype == "protein_coding") %>%
    pull(gene_name) %>% unique
vst(dds)[rownames(dds) %in% Ygenes,] %>% plotPCA(ntop=Inf, intgroup="sex")
```

As you can see, most of the variance is explained by the first principal
component, as expected. It would in fact be enough to plot the values for the
PC1.

```
correct_sign <- function(x) {
    if (median(x)<0) return(-1*x)
    else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>% 
    assay %>%
    t %>%
    prcomp
pca1$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1) %>%
    mutate(PC1=correct_sign(PC1)) %>%
    left_join(as_tibble(colData(dds))) %>%
    ggplot(aes(x=sex, y=PC1)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(aes(colour=sex), height=0, width=.2)
```

```{r}
correct_sign <- function(x) {
    if (median(x)<0) return(-1*x)
    else return(x)
}
pca1 <- vst(dds)[rownames(dds) %in% Ygenes,] %>% 
    assay %>%
    t %>%
    prcomp
pca1$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1) %>%
    mutate(PC1=correct_sign(PC1)) %>%
    left_join(as_tibble(colData(dds))) %>%
    ggplot(aes(x=sex, y=PC1)) +
    geom_boxplot(outlier.shape=NA) +
    geom_jitter(aes(colour=sex), height=0, width=.2)
```

For simplicity, we are going to remove from the `dds` object all genes mapped
to the sex chromosomes.

<q>Remove all X and Y genes from the `dds` object.</q>

```{r}
sex.genes <- geneNames %>% filter(seqnames %in% c("X","Y")) %>%
    pull(gene_name) %>% unique
dim(dds)
dds <- dds[!rownames(dds) %in% sex.genes,]
dim(dds)
```

<q>Recalculate the pairwise correlations between samples as before with the
filtered dataset. Plot the values and plot the PCA with ntop=1000</q>

```{r}
norm_expr <- vst(dds) %>% assay
top1000.genes <- tibble(gene_name=rownames(norm_expr), sd=norm_expr %>% rowSds ) %>%
    slice_max(sd, n=1000) %>%
    pull(gene_name)
Cors <- cor(norm_expr[top1000.genes, ])
medCors <- rowMedians(Cors)
names(medCors) <- rownames(Cors)
medCors %>% sort
```

```{r}
ggplot(tibble(medCors)) + geom_boxplot(aes(x=1, y=medCors))
```

```{r, fig.height=6, fig.width=5}
plotPCA(vst(dds))
```

In order to plot the sample ids, we need to use the "returnData=TRUE" option,
in the `plotPCA()` function which, instead of plotting, will then returns the
data in a format that can be easily used with ggplot2, adding a layer with
the labels:

    ggplot(data) +
    geom_text(aes(label=name), nudge_y=1.5, size=3)

We could also, of course, run the PCA ourselves using `prcomp()`.

```{r, fig.height=5, fig.width=8}
plotPCA(vst(dds), returnData=TRUE) %>%
  ggplot(aes(x=PC1, y=PC2)) + 
  geom_point(aes(colour=condition)) +
  geom_text(aes(label=name), nudge_y=1.5, size=3)
```

Evaluating the PCA plot and the heatmap of correlations, do you think it is
justified to remove any of the samples?






# Cell type estimation

Before proceeding with the differential expression analysis, since we are
dealing with human brain tissue, we need to investigate what the cell
composition is in our samples. This is of fundamental importance in these type
of datasets because **most of the variability in gene expression in explained
by the cell types**.

We will use a very simple approach to estimate cell types similar, in a way, to
what we used to "estimate" sex: summarize gene expression of a set of selected
markers as their first principal component (PC1). The marker genes will be
known cell type-specific genes.

Let's load a list of markers of cortical cell types from
[Neuroespresso](https://www.eneuro.org/content/4/6/ENEURO.0212-17.2017):

    cortical.markers <- read_tsv("./Data/cortical_markers.tsv")

<q>Read the tab-separated file "cortical_markers.tsv" into a tibble </q>

```{r}
cortical.markers <- read_tsv("./Data/cortical_markers.tsv")
```

Now let us run a PCA (as we have done until now) but subsetting the genes to
the neuronal markers. We will use the `prcomp()` function and extract the
values of the PC1 for each sample.

<q>Run a PCA using `prcomp()` only on neuronal markers. Explore the result and
find where the coordinates for the first PC are.</q>

```{r}
estimate_ct <- function(object, genes, ct="PC1", rescale=TRUE) {
    object.vst <- vst(object)
    pca <- object.vst[rownames(object.vst) %in% genes,] %>%
        assay() %>%
        t() %>%
        prcomp()
    result <- as_tibble(pca$x, rownames="sample_id") %>%
        select(sample_id, PC1) %>%
        mutate(PC1=correct_sign(PC1))
    colnames(result) <- c("sample_id", ct)
    if (rescale) result[,ct] <- scale(result[,ct])[,1]
    return(result)
}
neuronal.est <- estimate_ct(dds, cortical.markers %>% filter(cell_type=="Neuron") %>% pull(gene_name), ct="Neuron")
```

Now we can do that for every other cortical cell type.

<q>Calculate the first principal component for the other cell types and add
them all to the metadatata of the dds object (`colData(dds)`).</q>

```{r}
ct.est <- lapply(unique(cortical.markers$cell_type), function(ct){
    message(paste0(ct, "..."))
    Genes <- filter(cortical.markers, cell_type==ct) %>% pull(gene_name)
    estimate_ct(dds, genes=Genes, ct=ct)
}) %>% Reduce("left_join", .)

newColData <- left_join(as_tibble(colData(dds)), ct.est, by="sample_id") %>%
    as("DataFrame")
rownames(newColData) <- newColData$sample_id
all(rownames(colData(dds)) == rownames(newColData))
colData(dds) <- newColData
```

And finally, do the same with the RNA markers for the cortical synapsome, which
are not quite the same as the set of neuronal markers. These are RNAs found in
the synapses, obtained from [Hafner et al.
2019](https://pubmed.ncbi.nlm.nih.gov/31097639/).

    synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")
    
```{r}
#syn <- qusage::read.gmt("./Data/synaptosome_symbols.gmt") %>% qdapTools::list2df() %>% as_tibble
#colnames(syn) <- c("gene_id", "cell_type")
#write_tsv(syn, "./Data/synaptosome_markers.tsv")
synaptosome.markers <- read_tsv("./Data/synaptosome_markers.tsv")

syn.est <- estimate_ct(dds, genes=unique(synaptosome.markers$gene_id), ct="synaptosome")
all(syn.est$sample_id == colData(dds)$sample_id)
colData(dds)$Synapses <- syn.est$synaptosome
colData(dds)
```

How can we assess which variables are responsible for most of the variation in
gene expression? PCA is very helpful in this regards. By definition, the first
principal component will maximize the variance that it explains by combining
linearly the variables. In a way, it is "summarizing" gene expression by
optimally projecting it into one dimension. Then, with the leftover variance
not explained by PC1, it will do the same, but with the constaint that PC2
needs to be perpendicular to PC1.

The first principal components are often referred to as the main lines of
variation, because they gather as much as possible, and often serve as a good
"summary" of the multidimensional data.

If we want to assess whether cell type composition (or other variables) explain
most of the variation, a simple way could be to test for association between
variables and PCs. This is a bit involved programmatically, so here is the
code:

```
scale.vec <- function(x) scale(x)[,1]

pca2 <- dds %>% 
    vst %>%
    assay %>%
    t %>%
    prcomp

Metadata.pca <- pca2$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1:PC5) %>%
    mutate_if(is.numeric, scale.vec) %>%
    left_join(as_tibble(colData(dds)))

Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
    mutate_if(is.factor, as.integer)

allCors <- lapply(paste0("PC", 1:5), function(pc) {
    lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
        #message(paste0(pc, " vs ", var))
        lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
            broom::tidy() %>% filter(term!="(Intercept)") %>%
            mutate(PC=pc, VAR=var)
    }) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)

Pvals <- allCors %>% select(p.value, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC

Estimates <- allCors %>% select(estimate, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC

corrplot::corrplot(e.mat, p.mat=p.mat)
```

```{r, fig.height=4, fig.width=4}
scale.vec <- function(x) scale(x)[,1]

pca2 <- dds %>% 
    vst %>%
    assay %>%
    t %>%
    prcomp

Metadata.pca <- pca2$x %>%
    as_tibble(rownames="sample_id") %>%
    select(sample_id, PC1:PC5) %>%
    mutate_if(is.numeric, scale.vec) %>%
    left_join(as_tibble(colData(dds)))

Metadata.pca <- Metadata.pca %>% select(-sample_id, -origin, -sample_id_paper) %>% mutate_if(is.character, as.factor) %>%
    mutate_if(is.factor, as.integer)

allCors <- lapply(paste0("PC", 1:5), function(pc) {
    lapply(colnames(Metadata.pca)[6:ncol(Metadata.pca)], function(var){
        #message(paste0(pc, " vs ", var))
        lm(formula=as.formula(paste0(pc, "~", var)), data=Metadata.pca) %>%
            broom::tidy() %>% filter(term!="(Intercept)") %>%
            mutate(PC=pc, VAR=var)
    }) %>% Reduce("bind_rows",.)
}) %>% Reduce("bind_rows",.)

Pvals <- allCors %>% select(p.value, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=p.value)
p.mat <- as.matrix(Pvals[,-1])
rownames(p.mat) <- Pvals$PC

Estimates <- allCors %>% select(estimate, PC, VAR) %>%
    pivot_wider(names_from=VAR, values_from=estimate)
e.mat <- as.matrix(Estimates[,-1])
rownames(e.mat) <- Estimates$PC

corrplot::corrplot(e.mat, p.mat=p.mat)
```

Do you see anything interesting? May it be that the RNA quality (RIN) is
associated with cell type composition? That can easily be tested with a quick
linear regression: try predicting the RIN values with the estimated values for
synaptosomal content, then try the same with the estimated neuronal content,
and then with both at the same time in the model. The formulas are as follow:

* rin ~ Synapses
* rin ~ Neuron
* rin ~ Synapses + Neuron

Then compare the resulting models using the `summary()` function on the linear
fits.

<q>Run 3 linear regressions with rin as a response variable and 1. Synapses; 2.
Neuron; and 3. Synapses + Neuron as predictors. Check the adjusted R-squared
values to compare the models.</q>

```{r}
# RIN
lm.syn <- lm(rin~Synapses, data=as.data.frame(colData(dds)))
lm.neu <- lm(rin~Neuron, data=as.data.frame(colData(dds)))
lm.both <- lm(rin~Neuron+Synapses, data=as.data.frame(colData(dds)))
lm.both.cohort <- lm(rin~Neuron+Synapses+origin, data=as.data.frame(colData(dds)))
lm.syn %>% summary
lm.neu %>% summary
lm.both %>% summary
lm.both.cohort %>% summary
```

Synaptosome expression seem to explain RIN the best. But what if we consider
"origin" as well? Which is the "best" model? Check the adjusted R squared for
the models or run `anova()` on two models to test whether a more complex model
is significantly better at capturing RIN variation.

<q>Add "origin" as a covariate to the previous models and compare the adjusted
R squared between the simpler and more comples versions of the model. Even
better, use the `anova()` function to compare the models.</q>

```{r}
anova(
    lm(rin~Synapses, data=as.data.frame(colData(dds))), 
    lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Synapses, data=as.data.frame(colData(dds))), 
    lm(rin~Synapses+origin, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Neuron, data=as.data.frame(colData(dds))),
    lm(rin~Neuron+origin, data=as.data.frame(colData(dds)))
)
anova(
    lm(rin~Synapses+origin, data=as.data.frame(colData(dds))),
    lm(rin~Synapses+Neuron+origin, data=as.data.frame(colData(dds)))
)
lm.final <- lm(rin~Synapses+Neuron, data=as.data.frame(colData(dds)))
```

You can also check the diagnostic plots by simply using the function `plot()`
on the linear fit. Do you see a sample that you may want to exclude before
running your model again? Does it make a difference?

```{r, fig.height=4, fig.width=5}
plot(lm.final)
```




# Differential gene expression

## Running an differential expression analysis

With the `DESeq2` package, we can use the function `DESeq()` on our `dds`
object to run the whole differential expression pipeline, which will run three
steps in order:

1. `estimateSizeFactors(dds)`: estimation of sample-specific normalization parameters
2. `estimateDispersions(dds)`: estimation of gene-specific dispersion parameters
3. `nbinomWaldTest(dds)`: negative binomial generalized linear model to calculate the desired log2-fold changes and calculation of Wald statistics

When you run `DESeq`, by default it will calculate the fit according to
whatever model you had specified when you created the object. The model's
formula can be printed on screen using `design(dds)`.

Let's change the model's formula of the object like so:

    design(dds) <- ~ sex + age_years + condition

and then run `DESeq()` on the `dds` object.

**Note 1:** if you have manually changed something in the `colData(dds)`, it may
be that some columns are of type "character". They should be converted to
factors.

**Note 2:** to avoid some headaches later on, we want to be very explicit with
our variable of interest (the disease status) by specifying which of the two
levels ("case" or "control") is the REFERENCE value. This can be done by using
the `relevel()` function like so:

    colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")

<q>Run a differential expression analysis on the `dds` object using the
`DESeq()` function. The model's formula should be __~ sex + age + condition__.
Overwrite with the returned object the `dds` object.</q>

Useful functions:

* `mutate_if()`
* `is.character()`
* `DESeq()`

```{r}
# convert to char to factor
chr_cols <- sapply(colData(dds), is.character)
colData(dds)[chr_cols] <- lapply(colData(dds)[chr_cols], as.factor)

# relevel condition just in case
colData(dds)$condition <- relevel(colData(dds)$condition, ref="Control")

design(dds) <- ~ sex + age_years + condition
dds <- DESeq(dds)
```

## Looking at the results

The `results()` function extracts the log2 fold changes and p values from the
`dds` object (as long as the `DESeq` function was run). By default, the
function will extract the log2 fold change **for the last variable in the
design formula** (in our case, the disease status). However, it is always good
practice to explicitly specify the contrast when calling `results()` with the
"contrast" or "name" argument. There are different ways of specifying contrasts
but, for example, if we wanted to extract the differential expression results
between males and females, we could use:

    results(dds, name="sex_M_vs_F")

or

    results(dds, contrast=c("sex", "M", "F"))

they are equivalent. A nice trick to see which contrasts can be extracted using
"name=", you can use

    resultsNames(dds)

which prints all possible values for the contrast name. In our case, the
possible values for the "name=" argument are: **"Intercept", "sex_M_vs_F",
"age_years", and "condition_Case_vs_Control"**

<q>Use the `results()` function on the `dds` object to extract the differential
expression between cases and controls and find out whether any genes are
significant after correction.</q>

```{r}
#results(dds, name="sex_M_vs_F")
#results(dds, contrast=c("sex", "M", "F"))
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)
```

Are there any significant changes in gene expression associated with the
disease? 

We have seen that RIN is associated with the main lines of variation in gene
expression. We should expect, then, that RIN is adding a lot of "uninteresting"
variation to the dataset, i.e., a lot of noise. What happens if you add "rin"
as a covariate?

You'll need to update the formula and re-run the whole pipeline like so:

```
design(dds) <- ~ sex + age_years + rin + condition
dds <- DESeq(dds)
res <- results(dds, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res %>% arrange(padj)
```

<q>Re-run the analyses with RIN as a covariate.</q>

```{r}
dds_rin <- dds
design(dds_rin) <- ~ sex + age_years + rin + condition
dds_rin <- DESeq(dds_rin)
res_rin <- results(dds_rin, name="condition_Case_vs_Control") %>% as_tibble(rownames="gene_name")
res_rin %>% arrange(padj)
```

What do you think has changed? How does the extra covariate added influence the
results? What do you think would have happened if you added to the model a
covariate that was highly associated with the condition of interest (i.e.,
"confounded")?


Since the cohorts had different RINs and post-mortem intervals, try to run the
cohorts separately and explore the results.

<q>Run the differential expression analyses separately for each cohort.</q>

```{r}
dds_pw <- dds[,colData(dds)$origin=="PW"]
dds_pw <- DESeq(dds_pw)
results(dds_pw) %>% as_tibble(rownames="gene_name") %>% arrange(padj)
dds_nbb <- dds[,colData(dds)$origin=="NBB"]
dds_nbb <- DESeq(dds_nbb)
results(dds_nbb) %>% as_tibble(rownames="gene_name") %>% arrange(padj)
```

How can two cohorts exhibit such differences in their results? What does this
suggest about inter-study replicability?



# Gene set enrichment analyse

We are going to use the `fgsea` package to run a gene set enrichment analysis
on your results.

<q>Install the `fgsea` package using `Bioconductor::install(fgsea)` and load
the package.</q>

```{r}
require('fgsea')
```

We can download "gmt" formatted files from
[MSigDB](http://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) website, which
can be imported into R using the `gmtPathways()` function from the `fgsea`
package. I have already downloaded the KEGG database, which can be loaded into
a variable like so:

    KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")

<q>Load the KEGG pathway list using the previous command.</q>

```{r}
KEGG_pathways <- gmtPathways("c2.cp.kegg.v2022.1.Hs.symbols.gmt")
```

We are going to prepare an input for the `fgsea()` function. We need to rank
the genes according to our value of interest (in our case, it will be the
"stat" column of our differential expression results, although you could use
-log10(p-value) or log2FoldChange).

<q>Create a vector with the "stat" values from the results and name the values
with the corresponding gene names.<q>

```{r}
ranks <- res$stat
names(ranks) <- res$gene_name
```

<q>Run `gsea_res <- fgsea(KEGG_pathways, ranks)` and explore the results
("ranks" is the vector of differential expression stats).</q>

```{r}
gsea_res <- fgsea(KEGG_pathways, ranks) %>% arrange(padj)
gsea_res
```

What is at the top? Can you run the GSEA with the results from the two
alternate models (i.e., with and withouth accounting for the effect of RIN)?

<q>Create a new vector with the stats of the appropriate results and run
`fgsea` again.<q>

```{r}
ranks_rin <- res_rin$stat
names(ranks_rin) <- res_rin$gene_name
gsea_res_rin <- fgsea(KEGG_pathways, ranks_rin) %>% arrange(padj)
gsea_res_rin
```

What happens if you add Neuron or Synapses to the design model? How that does
affect the GSEA results?

<q>Run another differential expression analysis using Neuron and/or Synapses as
covariates and explore the differences in the enriched pathways.<q>

```{r}
dds_neu <- dds
design(dds_neu) <- ~ sex + age_years + rin + Neuron + Synapses + condition
dds_neu <- DESeq(dds_neu)
res_neu <- results(dds_neu) %>% as_tibble(rownames="gene_name")
ranks_neu <- res_neu$stat
names(ranks_neu) <- res_neu$gene_name
gsea_res_neu <- fgsea(KEGG_pathways, ranks_neu) %>% arrange(padj)
gsea_res_neu
```


Can you think of a way to identify the pathways that are more susceptible to
RIN and/or Neuron correction?


